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Abstract. In simulations of supernovae, linear systems of equations 
with a block penta-diagonal matrix possessing small, dense matrix blocks 
occur. For an efficient solution, a compact multiplication scheme based on 
a restructured version of the Thomas algorithm and specifically adapted 
routines for LU factorization as well as forward and backward substi- 
tution are presented. On a NEC SX-8 vector system, runtime could be 
decreased between 35% and 54% for block sizes varying from 20 to 85 
compared to the original code with BLAS and LAPACK routines. 
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1 Introduction 


Neutrino transport and neutrino interactions in dense matter play a crucial 
role in stellar core collapse, supernova explosions and neutron star formation. 
The multidimensional neutrino radiation hydrodynamics code PROMETHEUS / 
VERTEX [I] discretizes the angular moment equations of the Boltzmann equa- 
tion giving rise to a non-linear algebraic system. It is solved by a Newton 
Raphson procedure, which in turn requires the solution of multiple block-penta- 
diagonal linear systems with small, dense matrix blocks in each step. This is 
achieved by the Thomas algorithm and takes a major part of the overall com- 
puting time. Since the code already performs well on vector computers, this kind 
of architecture has been the focus of the current work. 

The Thomas algorithm [23] is a simplified form of Gaussian elimination with- 
out pivoting, as originally applied to tridiagonal systems. Bieniasz [4] gives a 
comprehensive overview of the numerous adaptations for special cases and mu- 
tations of tridiagonal systems, the extensions to cyclic tridiagonal systems and 
the transfer to block tridiagonal matrices. However, an efficient implementation 
for block penta-diagonal systems has not yet been considered in the literature. 
An additional challenge consists in solving systems with relatively small matrices 
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as used in the Thomas algorithm. Optimizations of LAPACK routines usually 
target large matrices as necessitated for example for the HPL benchmark [5]. 

The remainder of the paper is organized as follows: after introducing the 
Thomas algorithm and the concept of vector architectures, section 2 presents 
the compact multiplication scheme. Section 3 explains implementation issues for 
the newly introduced scheme and the LU decomposition and states the results 
obtained, followed by the summary of the paper in section 4. 


1.1 Thomas Algorithm 


Consider a linear system of equations consisting of a block penta-diagonal (BPD) 
matrix with n blocks of size k x k in each column resp. row, a solution vector 
x and a right hand side (RHS) f. The vectors z = (afa}...x7)" and f = 
(ai. a ar are each of dimension k- n. The BPD system is defined by 


AiXj—2 a Biz; Bg Cia, a Diziyi T EiZi 19 = fp Lis (1) 


by setting, for ease of notation, Aj By Ao En-1 Dn E,, = 0, and 
implementing z and f as (x7, z ... £T) and cm i . af) 

Eliminating the sub-diagonal matrix blocks A; and B; and inverting the diag- 
onal matrix blocks C; would result in the following system: 


z+ Yi Zipi T Zi Zing T Lis 1<i<n-2, 
Ln—1 T Yn-1 n =n- (2) 
Tn = Ene 


In the Thomas algorithm, the new components Y;, Z; and r; are computed 
by substituting xz; 5 and z;_, in (I) using the appropriate equations of (2) and 
comparing coefficients. This results in 


Y= G (D; = K;Zi_1) 
Z,=G, E; a 8) 
t = G7! (f; — Åifi—2 — Kir;_1) 


where 
K; = B;— AiYi-2 
G; = Ci— KiYi-1 — AiZi-2 


assuming again, for ease of notation, Y-ı = Z_1 = Yo = Zo = 0. Then the 
solution is computed using backward substitution, i.e. 


Tn = Tns 
Zn—1 = Pn-1 7 Yn-1 Ln (5) 
ti S=ri — Yi tiy “Zizi t=n—2,-1,1. 


Since in practice, the inversion of G; in (3) is replaced by a LU-decomposition 
G; = L;U;, it’s crucial to understand that solving a BPD system includes two 
levels of Gaussian elimination (GE) and backward substitution (BS): one for the 
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whole system and one for each block row. The GE of the whole system (GES) 
requires the calculation of K; and G; shown in (4). GE and BS are then applied to 
each block row (GE_R / BS_R) computing Y;, Z; and r; in (B). Finally, backward 
substitution () is applied to the whole system (BS_S) to obtain the solution z. 


1.2 Vector Architecture 


Two different kinds of computer architectures are available today: scalar and 
vector. They use different approaches to tackle the common problem, memory 
latencies and memory bandwidth. Vector architectures [6[7] use SIMD instruc- 
tions in particular also for memory access to hide memory latencies. This re- 
quires large, independent data sets for efficient calculation. The connection to 
main memory is considerably faster than for scalar architectures. Since there are 
no caches, data are directly transfered to and from main memory. Temporary 
results are stored in vector registers (VRs) of hardware vector length VL, which 
is 256 on the NEC SX-8 [89]. As not all instructions are vectorizable, scalar 
ones are integrated into vector architectures. Because of their slow execution 
compared with commodity processors their use should be avoided by improving 
the vectorization ratio of the code. 


2 Reordering the Computational Steps 


For operations on small matrix blocks, memory traffic is the limiting factor for 
total performance. Our approach involves reordering the steps in (B) and (4) as 


K; = B; — A; Y;_2 G; = Gi — K; Yii Y= Gt - H; 
G; = C; — A; Zi-2 H; = Dj — K Zi-1 Z;=G;'- Æ; (6) 
r= f, — Ai ri—2 m=O — Kiri r= G;* eg 


where the following spaces can be shared: B; and K;; C;, G; and G;; D;, H; and Y;; 
E; and Z;, as well as f, ri, r! and r;. The improvements of this rearrangement 
are valuable for the GE of the whole system as well as for the solution of the block 
rows. First, during GE_S, A; and K; are loaded only k times from main memory 
instead of 3k times as is the case with a straight forward implementation. Sec- 
ond, by storing H;, E; and r% contiguously in memory, the inverse of G is applied 
simultaneously to a combined matrix of size k x (2k + 1) during GE_R / BS_R. 
Third, by supplying a specifically adapted own version of LAPACK’s xGETRF 
(factorization) and xGETRS (forward and backward substitution) routines [10], 
memory traffic is further reduced. By combining factorization and forward sub- 
stitution, L; is applied to the RHS vectors during the elimination process and 
therefore not reloaded from main memory. 


3 Implementation Details and Numerical Results 


In this section, the performance of our new approach presented in section P] 
is compared to the standard approach which uses LAPACK’s xGETRF and 
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xGETRS routines to compute the LU-factorization and apply forward / back- 
ward substitution and the BLAS routines xGEMM and xGEMV [O03] for 
matrix-matrix products and matrix-vector products, respectively. All compu- 
tations were carried out on a NEC SX-8 vector computer using the proprietary 
Fortran90 compiler version 2.0 / Rev. 340 [I3]. 

After presenting the results for the compact multiplication scheme used for 
GE.S, different implementations of GER and BS_R are analyzed. Since the 
number of RHS vectors is 2k + 1 and therefore depends on the block size k, a 
small, a medium and a large test case with block sizes of k = 20, 55 and 85 are 
selected representatively. After that, the performance of the solution process for 
one block row as well as for the whole system (I) is described. 


3.1 Applying Gaussian Elimination to the Whole System 


The first two systems of (6) are of the form 


D=B+A-C, 
G=E+A-F, (7) 
z=utA-w 


Using BLAS, each line is split into two operations, one copy operation, e.g. 
D = B, and one call to xGEMM or xGEMV. Instead, if A should only ” once” be 
loaded from main memory, the system (7) has to be treated as a single entity 
and xGEMM or xGEMV can not be used any longer since neither the matrices 
nor the vectors are stored contiguously in memory. 

Our implementation listed below is a compact multiplication scheme. It con- 
tains two parts, the first one calculating the first column of D and G as well as 
the whole vector z, the second computing the remaining columns of D and G. 
The loop in 7 is stripmined to work on vectors smaller or equal to the hardware 
vector length VL. 


do is = i, k, VL 
ie = min( is+VL-1, k ) 


! start with vector z and first columns of D andG 
save v(is:ie), B(is:ie,1) and E(is:ie,1) to VRs vx, vmi and vm2 
dol=1,k 
store w(1), C(1,1) and F(1,1) to scalars vali, val2 and val3 
multiply A(is:ie,1) with vali, val2 and val3 and save 
results to VRs vxt, vmit and vm2t 
add VRs vxt(is:ie), vmit(is:ie) and vm2t(is:ie) to VRs 
vx(is:ie), vmi(is:ie) and vm2(is:ie) 
end do 
store VRs vx, vmi and vm2 toz(is:ie), D(is:ie,1) and G(is:ie,1) 


! do rest of columns of D and G 
do j = 2, k 
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save B(is:ie,j) and E(is:ie,j) to VRs vmi and vm2 
dol=1,k 
store C(1,j) and F(1,j) to scalars vall and val2 
multiply A(is:ie,1) with vali and val2 and save results 
to VRs vmit and vm2t 
add VRs vmit(is:ie) and vm2t(is:ie) to VRs vmi(is:ie) and 
vm2(is:ie) 
end do 
store VRs vmi and vm2 to D(is:ie,j) and G(is:ie,j) 
end do 
end do 


Tbl. [] compares the performance of the BLAS code to the introduced com- 
pact multiplication scheme. The latter leads to an average decrease of 11 — 31% 
in execution time. It is considerably faster for small and medium block sizes, 
whereas for large block sizes, its performance is approached by BLAS. 


Table 1. Execution time of different implementations solving for 150000 calls 


k= 20 30 40 50 60 70 80 90 


LAPACK [s] 2.23 4.65 7.68 11.97 16.92 22.49 28.44 37.99 
comp. mult. [s] 1.54 3.25 5.93 9.07 13.19 18.49 25.28 33.71 
decr. in runtime [%] 31.1 30.1 22.9 242 221 178 111 11.3 


3.2 Applying Gaussian Elimination and Backward Substitution to 
One Block Row 


Gaussian Elimination. The search of the pivot element and exchange of ma- 
trix rows is not very costly. Measurements on the NEC SX-8 indicated that the 
search for the maximum value in the pivot column using Fortran’s intrinsic func- 
tion maxval and then looping until the correct index has been found is generally 
faster than a search over the whole pivot column. 

The performance of the different versions of GE is depicted in figure Blana B] 
The simplest implementation of row-oriented GE with immediate update for 
a given step j consists of just two loops in i (column index) and / (row index) 
updating the remaining parts of the matrix and RHS vectors. The loop in l over 
the remaining rows is the longer one and therefore the innermost. This loop 
order is switched by the compiler which naturally degrades performance. The 
next variant ”val + UR(4)”, introducing a temporary scalar for elements of the 
pivot row, still with the compiler’s switching of loops, leads to an unrolling of 
depth 4 in l causing strided memory access. The compiler directive select (vector) 
on the SX-8, normally used for parallel constructs, provides a simple means to 
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Fig. 1. Performance for GE_R for k = 20 Fig. 2. Performance for GE_R for k = 55 


enforce the correct order of the loops. This version ”val + OUR(4)” with a 
temporary scalar now has a compiler-generated outer unrolling in 7 with depth 
4. The remaining two variants ”val + vreg + OUR(4)” and ”val + vreg + 
OUR(8)” use VRs for the pivot row, a temporary scalar for elements of the 
pivot column as well as outer unrolling in i with depths 4 or 8. As can be seen 
from these figures, the correct loop order is essential for good performance. Using 
temporary scalars for elements of the pivot column, a VR for elements of the 
pivot row and outer unrolling in ¿į of depth 8 leads essentially to the highest 
MFlop rates on the tested vector system. 


Backward Substitution. The results for the different versions of row-wise 
backward substitution explained below are shown in figure Ø| [B] and [6] The 
simplest version for a given step j also consists of just two loops. The obvious 
first improvement, introducing a temporary scalar for elements of the ” pivot row” 
leads to compiler-generated inner unrolling with depth 9993 and is therefore not 
shown. The second variant ” val + OUR(4)” imposes outer unrolling with depth 4. 
The remaining variants ” val + vreg + OUR(8)”, ”val + vreg + OUR(16)” and 
*val + vreg + OUR(4/16)” use one VR for elements of the pivot row, temporary 
scalars for elements of the pivot column as well as outer unrolling of different 
depths. Apparently, the fastest version uses a VR for elements of the pivot row, 
temporary scalars for elements of the pivot column and explicitly coded, yet 
compiler-generated, outer unrolling of depth 16 as long as possible and of depth 
4 for the remaining columns. 


Performance Results. Using the optimal implementation for GER, / BS_R, 
runtimes on the NEC SX-8 may be reduced by the following factors (for 50000 
calls) compared to the LAPACK version: by 61.5% from 5.04s to 1.94s for k = 20, 
by 43.3% from 23.98s to 13.59s for k = 55 and by 41.2% from 54.54s to 32.08s 
for k = 85. 
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Fig. 3. Performance for GE_R for k = 85 Fig. 4. Performance for BS_R for k = 20 
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Fig. 5. Performance for BS_R for k = 55 Fig. 6. Performance for BS_R for k = 85 


3.3 Solution of a Sample BPD System 


The compact multiplication scheme and the improvements of the solution of 
the block rows are integrated into a new BPD solver. Its execution times are 
compared to a traditional BLAS/LAPACK solver in table P] for 100 systems 
with block sizes k = 20,55 and 85 and n = 500. 

Except for the first version using bandwise storage, the diagonals are stored 
as five separate vectors of matrix blocks. If H;, Z; and r; in (G) are stored 
contiguously in memory, only one call to xGETRS is needed instead of three. 


Table 2. Execution times for BPD solver 


k= 20 55 85 


11.63 36.79 79.70 
8.33 40.22 85.26 


] 
BLAS + 3 LAPACK calls [s] 
[İs] 643 33.80 76.51 


s 
BLAS + 1 LAPACK call (x) 

comp. mult. + new solver (xx) 
decr. in runtime between (x) and (xx) [% 


3.79 23.10 55.10 
54.5 42.6 35.4 
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4 Summary 


The solution of a linear system of equations with a BPD matrix with block 
sizes ranging from 20 to 85 was investigated. A substitute for xGEMM and 
xGEMV based on a restructured version of the Thomas algorithm as well as 
specifically adapted versions of LAPACK’s xGETRF and xGETRS routines were 
implemented. Best results were obtained with a compact multiplication routine 
for the global system and a combined factorization and forward substitution 
scheme for each block row. The latter uses temporary scalars for elements of the 
pivot column, a vector register for elements of the pivot row and outer unrolling 
of depth 8 during Gaussian elimination and of depth 16 and 4 during backward 
substitution. For a NEC SX-8 vector system a decrease in total runtime between 
35% and 54% for test cases of block size 20, 55 and 85 compared to the original 
code using BLAS and LAPACK was achieved. 

An efficient implementation for scalar architectures will be the subject of 
further research. 
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